Adaptive video telephone system

ABSTRACT

A videophone system providing high resolution video transmission between videophones utilizes compressed video signals and audio signals which may be transmitted through any communications network, with the system providing real time adaptive error recovery and synchronization between the audio and video signals to produce high quality video images and sound. The system is highly resilient, adapting in real time to changing conditions in a network or to network errors due to data corruption or loss that can be produced, for example, by noise or line losses, thereby substantially eliminating fitter, signal packet loss or delay, or other errors which produce signal degradation.

This application is a national stage application, filed under 35 U.S.C.§371, of International Application No. PCT/US2005/014554 with aninternational filing date of Apr. 27, 2005, and claims priority to U.S.Provisional Application No. 60/566,410, filed Apr. 30, 2004, eachapplication of which is incorporated by reference in its entirety.

FIELD OF THE INVENTION

The present invention relates, in general, to video telephone systemsfor transmitting compressed video data over a network, and moreparticularly to adaptive systems for real time network error recoveryfor high quality video telephone communications.

BACKGROUND OF THE INVENTION

Traditionally, telephone and video communication systems have beenbifurcated. Conventional telephone systems (or PSTN systems) operate ata bandwidth appropriate for voice communications, and typically providespontaneous, point-to-point communications, such as two-way voice anddata services, between two end users. In contrast, video distributionnetworks (including cable television systems), operate at a much broaderbandwidth than telephone systems, and are usually employed to transmitpre-determined, high quality, full-motion video and audio concurrentlyto a plurality of subscribers.

It has long been felt that if the best features of voice and videocommunication systems could be combined appropriately, fully interactivevideo telephony would become feasible, and accordingly video telephonyhas been the subject of commercial development for many years. Althoughthe first videophone appeared as early as the 1930s, a commerciallyviable videophone has yet to be introduced, even though significantefforts have been devoted to developing the same. This has been due, inlarge part, to the relatively high cost of videophones, their complexityboth in design and use, their inability to concurrently provide qualityimage and sound, and their inability to provide a network infrastructurecapable of two-way communications with minimal signal degradation.

Prior attempts at video telephony typically have resembled traditionalbusiness telephone desk sets with the addition of a display monitor anda camera, together with associated controls for operating thevideophone. The cost of such devices has typically been in excess of$1000, which is above the level of affordability for many users, andthis cost is compounded since at least two videophones are needed tomake a video call. Furthermore, these devices are often relativelylarge, and not portable.

The quality of the image and sound in such prior videophones istypically substantially less than what is expected by most people fornormal communications. Only a minimal capability, if any, is providedfor accommodating different ambient conditions, or different audiocharacteristic (e.g., canceling ambient noise and feedback within theaudio signal, accommodating concurrent conversations by both parties tothe call). Furthermore, the signal processing utilized for such devices,including the techniques used for compressing and decompressing theresulting audio and video signals, has not been optimized with theresult that the quality of both the transmitted and received video ismuch less than what is expected from a communications system. Forexample, varying ambient light conditions often result in over-exposedor under-exposed pictures. Movement of the user often results in both asignificant degradation in image quality as well as the possibility thatthe camera can no longer capture the image of the user (e.g., outside ofthe limited range of view of the camera).

Because of the complexity of prior systems, there is a complicatedset-up process to configure the videophone to the particularcommunications network being utilized. Even videophones that can workwith multiple types of communications networks are far from “plug ‘n’play” with any network. In addition, the videophone must be locatedwhere it can be directly connected to the available communicationnetwork via an Ethernet or comparable connection, severely limitingflexibility in locating and using the videophone. Since a videophonetypically uses traditional IP addressing, a user must enter a numbersequence that is different from what people are accustomed to as astandard phone number. Furthermore, there typically is no provision fortelephone services and applications such as caller ID, call waiting,call forwarding, conferencing and the like.

Videophones are expected to work across long distances which encompassmultiple networks and network infrastructures. Delays in transmissionsand the presence of noise degrade the signal quality. Even though priorvideophones have advertised high frame rates and transmission speeds,they do not typically achieve these speeds due to the upstream anddownstream characteristics of communications networks or due to lossynetworks which cause data to be corrupted or lost during transmission.This results in degraded images and sound quality, jitter, lack ofsynchronicity between the voice and video, etc.

In prior systems, attempts have been made to overcome degraded images,loss of synchronism between audio and video, jitter and delay throughthe use of feedback systems which detect errors such as missing datapackets and request retransmission by the source. Such error recoveryrequires buffers for temporary storage of received signals, and producesdelays in any communication between videophones. This lack of real timecommunication is unacceptable in videophone systems.

SUMMARY OF THE INVENTION

In accordance with the present invention, a videophone system isprovided in which high resolution video transmission between videophonesis achieved. Compressed video signals and audio signals may betransmitted through any communications network, with the systemproviding real time adaptive error recovery and synchronization betweenthe audio and video signals to produce high quality video images andsound. The system is highly resilient, adapting in real time to changingconditions in a network or to network errors due to data corruption orloss that can be produced, for example, by noise or line losses, therebysubstantially eliminating jitter, signal packet loss or delay, or othererrors which produce signal degradation.

A videophone system may consist of a video source at one end and a videodisplay at the other, connected by a medium that may be lossy, and mayoperate using an encoding technique in which data representing currentvideo may depend on earlier pictures in the video. An example of such asystem would be a pair of videophones, each with a camera and a displaythat shows video from the other phone, encoded using the standard videocompression technique known as H.264, but connected by an unreliablenetwork. The present invention provides ways to increase the quality andreliability of such a system.

An encoding scheme such as H.264 uses concepts such as I-slices, whichare slices of a frame, or picture, encoded without being based oninformation in any other picture than the current one; P-slices, whichare slices based on one or more previous “reference” slices, that areencoded frames used to encode future frames; SI-slices, which areI-slices that encode the same resultant data as a particular P-slice orSP-slice; and SP-slices, which are P-slices that encode the sameresultant data as a particular P-slice or SP-slice by using a differentset of reference slices. A frame will consist of one or more slices.

A reference slice is an encoded picture, or section of a picture, thatis used as a reference for encoding future pictures. The data from adecoded reference slice is required to correctly decode one or morefuture pictures. If a reference slice is lost or damaged, then one ormore future pictures cannot be correctly decoded.

A lossy network is one that may cause data to be corrupted or lostduring transmission. Such a lossy network can cause disruption of avideo stream by causing a reference slice to be lost or corrupted sothat following pictures are impossible to correctly decode. Methods forreducing the severity and impact of a lossy network are part of thepresent invention.

Recording video transmissions, such as for saving messages, may requirethe ability to move around within a received stream at playback time, toprovide features such as fast-forward, reverse, and skip. An originalreceived stream may not have the features that make this possible, forincluding this information in the original transmission would reduce thepicture quality or increase bandwidth usage, so a method for adding thatinformation after reception is useful.

When a video signal is being recorded at the receiving end (such as forrecording a video message), the received video can be processedimmediately or later to add data that allows for fast-forward, rewind,and jumping to a point within the video stream. An example would be totake a stream of I-slices or P-slices and add SI and/or SP slices eitherinto the recorded stream, or to create a second stream from which it ispossible to switch into or back from the main stream.

Accordingly, the present invention is directed to a method and systemfor creating and processing compressed video streams and dealing withpossible errors in the streams. Data within the stream may be protectedagainst errors in an adaptive manner based, for example, on howimportant the data is, and such protection may be provided byduplication or other methods. Methods utilizing techniques such asthresholds can be used to select which pictures are used as referenceslices to minimize the impact of protecting them from error.

In order to maximize the quality of the video, it is important to knowwhich portions of the image are the most important to the viewer, andweighting those parts of the image to maximize the perceived quality ofthe picture. The areas that are important can be determined by usinglong-term motion and change information, or by using object recognitionfor objects such as faces. This allows more bits to be used for theareas that are more important to the viewer and fewer bits for otherareas of the picture, even if there is motion or change in those lessimportant areas in a particular picture.

If desired, the amount of motion in different areas of the pictures ofthe video stream may be used to weight the encoding process, such thatmore data is used to encode areas that have long-term histories ofmotion, especially in cases such as a “talking head” where the cameraand subject are mostly in fixed positions. The areas to be weighted maybe selected by using object recognition, such as by using face featurerecognition to identify where on the picture a “talking head” is andwhere certain facial features are, such as the mouth and eyes. Data fromprevious transmissions can be stored to provide an initial set ofweights or other predictive data for encoding the video.

If desired, reference slices from previous calls from the same sourcecan be saved by the receiver so that on the next transmission theencoder can use them at the start of the transmission instead of havingto start with an I-slice, and so improve the video quality or reduce thebit rate. The initial set of reference slices may be standardized sothat any transmission can make use of them to improve the video qualityor reduce the bit rate.

In accordance with the invention, high-resolution video transmission isobtained by sending multiple copies of reference slices or otherwiseproviding methods for recovering reference slices, such as forward errorcorrection, especially in the case where not every slice is a referenceslice. In this way the likelihood that the reference slice will beunrecoverable is reduced. Any number of multiple copies can be sent, andthe number of non-reference slices between reference slices can be anynumber including 0, and may vary in the video stream. The number ofcopies of reference slices, and/or the distance (or time) betweenreference slices is determined statically or dynamically by either theexpected or measured error rates.

A video image is typically encoded with reference slices, and theselection of which slices are reference slices is done by using theamount of change in the video from the previous reference slice orslices, especially where a small number of reference slices is used,instead of selecting reference frames without regard to the content.This selection may be based on an adaptable threshold, so that if thereis little change in the video over longer periods, the threshold forselecting a new reference slice would increase, and the threshold woulddecrease if the amount of change increased. The threshold value forcreating a reference slice is held above a minimum based on how muchextra bandwidth will be used by a reference slice versus how much addinga reference slice will reduce the number of bits to encode the video atthe same quality level.

In order to adapt the encoded video stream to overcome errors caused inthe network, a receiver is programmed to record the size and arrivaltime of each incoming packet, the time they were sent, and any apparentdelay of the packet. If the accumulated data indicates that an error isoccurring in the network, such as a delay caused by a network bandwidththat is too low, the receiver sends a message to the sender, telling itto reduce the bandwidth it is using, as by reducing the bit rate of theencoded data packets.

When the sender receives a message to reduce bandwidth, it responds bychanging the data stream appropriately, as by setting a new data bitrate. The receiver allows the sender to process the request before itresumes measurements of the received signals, and the process isrepeated. The system may be programmed to increase the bit rate by apercentage of the amount it was lowered if it has operated successfullyfor a set period of time at the lower bit rate.

The system of the invention is extremely resilient, in that it iscapable of responding rapidly and in real time to the loss of data fromthe video stream. This resiliency is provided by programming thereceiver to record whether or not each data packet that is transmittedis a reference slice. The receiver then responds to the received videostream to signal whether each packet was received or not received, andalso records the status of that packet, updates the running estimate ofincoming packet loss, and determines whether a lost packet contained areference slice. Based on the running estimate of incoming packetlosses, the sender is adjusted to send one of several possible mixes ofreference slices and nonreference slices, varying from all referenceslices to multiple nonreference slices between each reference slice,and/or changing the rate at which frames are sent. The mix of slices ischanged dynamically as network conditions change, so that the resilienceof the system is increased or decreased to tune the system as needed toproduce the highest possible video quality.

Although the invention is described in terms of a videophone system,which represents its preferred embodiment, it will be understood thatthe methods, techniques and processes described herein are applicable toother similar communications systems.

BRIEF DESCRIPTION OF DRAWINGS

The foregoing, and additional objects, features and advantages of thepresent invention will be better understood from the following detaileddescription of preferred embodiments thereof, taken with theaccompanying drawings, in which:

FIG. 1 is a block diagram of a videophone transmission system;

FIG. 2 is a block diagram of the data encoding process of the invention;

FIG. 3 is a diagrammatic illustration of a first image encodingtechnique in accordance with the invention;

FIG. 4 is a diagrammatic illustration of a second encoding technique forrecovery of recorded image signals;

FIG. 5 is a diagrammatic illustration of a decoding process for datastorage and playback;

FIG. 6 is a more detailed diagrammatic illustration of the videophonecommunication system of FIG. 1; and

FIG. 7 is a diagrammatic illustration of a videophone display inaccordance with the invention.

FIGS. 8-10 diagrammatically illustrate the adaptive receiver andtransmitter-side process of the invention; and

FIG. 11 diagrammatically illustrates the error resilience process of theinvention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

Turning now to a more detailed description of the invention, there isillustrated in FIG. 1 in block diagram form a simplified version of avideophone system in which a first videophone 10 is activated to send avideo and audio signal to a receiver 12 by way of a network 14, whichmay incorporate any desired communications path such as cable, awireless network, or the like. The receiver 12 preferably is also avideophone capable of transmitting video and audio signals to bereceived by videophone 10. Any number of videophones may be connected tothe network, as is well known.

It is expected that network 14 usually will not be ideal, and willintroduce delays or noise, or will otherwise corrupt data beingtransmitted, resulting in degraded images and loss of synchronizationbetween transmitted audio and video signals. A primary feature of thepresent invention is the provision of real time correction of corruptedor lost data to produce high-resolution video images synchronized withthe accompanying audio signals to produce enhanced videophonecommunication.

Video transmission systems typically use standard video compressiontechniques following one of the known standards identified as MPEG-4AVC; MPEG-4 part 10; H.264; and H.26L. The videophone system of thepresent invention preferably utilizes the H.264 standard, andaccordingly the following description will be based on that standard.

The H.264 standard for video compression uses previous image frames toencode current data, with subsequent frames encoding the differencesbetween succeeding images. As illustrated in FIG. 2, to start a sequencea first frame A is selected at sender 10, as indicated at 16, and isencoded, as at 18. The encoded data is transmitted by way of network 14to a decoder 20 in receiver 12, where frame A, or a close approximationof frame A, is reproduced at 22 for display or for recording, or both.The encoding of the image represented by frame A may be carried out bydigitally quantizing the image A to produce a frame of a givenresolution, such as 176×144 pixels. The frame is divided into a seriesof macroblocks, each of which is typically 16 pixels×16 pixels, whichare each encoded using the H.264 baseline profile standard. The encoder18 compresses all of the macroblocks for the first frame to produce adata packet. The first data packet in a stream is typically encoded asan I-slice, which does not reference any previous frame data, and isindicated at 24 in FIG. 3. The I-slice data packet is transmitted,decoded at the receiver, and is used in a display raster scan toreproduce frame A.

Subsequent frames are encoded, but only the differences between apreceding frame and the next succeeding frame are encoded, to reduce theamount of data that needs to be transmitted. The subsequent encodedframes are data packets which are referred to as predictive slices ofthe image, and are illustrated as P-slices P₁-P_(N) at 24, 26, 28, 30,32 and 34 in FIG. 3. Under the H.264 standard, typically each previousP-slice is used to encode the current frame to produce a P-slice fortransmission. Both the encoder and the decoder store the decoded datafor each frame.

Since the coding of a P-slice P_(N) depends on one or more previousframes, if one of the data packets is corrupted by the lossy network 14,subsequent data packets cannot be decoded without error. In priorsystems, if a packet is lost or corrupted, the receiver 12 would send afeedback signal to the sender 10 to request a recovery transmission,which typically was done using an I-slice. Transmission of this recoveryI-slice would typically use far more bits than a normal P-slice and thuswould cause delay and/or increased bandwidth usage. In accordance withthe present invention, such delays and/or increased bandwidth usage areavoided by sending the refresh via a P-slice based on previously known,correctly decoded frames. Thus, for example, if packets P₂ and P₃ arecorrupted by noise in the network 14, packet P₄ would be encoded on thebasis of data contained in the data packet represented by P-slice P₁,or, in some cases, on the basis of the data packet represented byI-slice 24. This allows the refresh to be done without any additionalbandwidth usage and/or delay since the recovery P-slice is typically ofsimilar size to a normal P-slice.

In accordance with the invention, in order to make the stream moreresilient to loss introduced by the lossy network 14, the particularslices chosen as reference slices can be varied. Loss of a non-referenceslice does not require recovery. The number of slices that are referenceslices may vary adaptively during a videophone communication session,with the frequency at which a reference slice is chosen being dependenton, and responsive to, problems such as the nature and frequency ofdisturbances on the network and the network bandwidth available. Thus,the fewer number of reference slices in a stream, the less likely thesystem will need a recovery transmission. For example, the more errorsthat occur in the network, the less often reference slices are sent, andthis can change continuously during a transmission to maintain imagequality at the receiver. This reference frequency adjustment may beresponsive to feedback signals from the receiver to the sender.

In another form of the invention, error recovery is enhanced byduplication of data packets such as the R, or reference slice packet 30,as illustrated in FIG. 4. In this embodiment, important data packets maybe sent two or more times to make sure that the data is receivedaccurately, and that subsequent P-slices 32, 34, 36, etc., areaccurately encoded and decoded. The number of non-reference slices canbe varied, as discussed above, before another reference slice 38 isgenerated and transmitted two or more times. Alternatively, instead oftransmitting a reference slice two or more times, other forms of forwarderror correction could be used.

The foregoing technique of adaptive data transmission based on networkconditions improves over prior systems, in which the selection ofreference slices is predetermined. Thus, the dynamic adaptation of thesender to changing network conditions provides improved image qualify.

The system of the present invention accommodates video messaging,wherein an incoming videophone call indicated by data packets 40-47 inFIG. 5 is stored at the receiver or at a data center for later playback.In the storage process, a secondary stream of data, incorporating SP₁,SP₂, etc., illustrated at 48 and 49, and SI, illustrated at 50 in FIG.5, which allow the receiver to play back portions of the recordedmessage. The data packets SP₁, SP₂, etc., allow fast forwarding of therecorded message to start play at any P-slice data packet following theSP₁ data slice; for example, to allow playback to start at P₄. Thereference data-switching packet SI allows rewinding the recorded messageto the location of that packet, and starting the playback at data packetP₆, as illustrated. This transcoding of data packets enables thereceiver videophone to record and then selectively play back a messagefrom a sender videophone.

A more detailed example of a videophone system in accordance with theinvention is illustrated at 60 in FIG. 6, to which reference is nowmade. The system includes first and second videophones 62 and 64, bothcapable of transmitting and receiving audio and video signals through acommon network 66. The network may be any communications network, eitherpublic or private, that is capable of carrying compressed video andaudio signals. The first videophone 62 incorporates a conventional videocamera 68 and a conventional video display 70, while the secondvideophone 64 similarly incorporates a video camera 72 and a videodisplay 74. The videophones also incorporate suitable audio transducers76 and 78, respectively. Video signals from camera 68 corresponding toan image to be transmitted are encoded at encoder 80 in the mannerdiscussed above, and are transmitted by a conventional transmitter (notshown) in the videophone through output line 82 to network 66. The datasent by videophone 62 is received from the network by videophone 64 byway of input line 84 to a decoder 86 in videophone 64. The decoder actson the received signals in the manner described above to produce anoutput which is displayed on display 74 to recreate the image detectedby camera 68. The decoder tracks the received data, acknowledges itsreceipt by way of line 90, network 66, and line 92, and providesfeedback information to encoder 80 to permit adaptation of thetransmitted reference data packet frequency to any network problemsdetected at the decoder, as discussed above.

In similar manner, the videophone 64 receives images from camera 72,encodes them at encoder 94 in the manner discussed above, and transmitsthe resulting data packets by way of output line 96 through network 66and input line 98 to decoder 100 in videophone 62 to display thetransmitted image on display 70. The decoder tracks the receiver datapackets and sends a feedback acknowledgement signal via line 102,network 66, and line 104 to encoder 94 to permit adjustment of theencoder output in response to network problems.

The Real Time Control Protocol (RTCP) standard is used to providefeedback data between the decoders and encoders. The encoders respond tothe RTCP and determine what changes in the encoding process arerequired. For example, the encoder may respond to packet loss bydecreasing the frequency of reference data packets, or may respond to anerror signal by immediately sending a new reference data packet. Theencoder tracks which frames are reference slices, and when a feedbacksignal indicates that data has been lost, the encoder determines whetherit needs to make a recovery transmission and if so, whether it canrecover via a P-slice or an I-slice. Each encoder is controlled togenerate the desired data packet sequences, and the decoders respond toreproduce a high quality image. The decoder will incorporate sufficientmemory to enable it to store the needed frames.

Audio signals received from the respective transducers 76 and 78 aretransmitted between the videophones through the network 66 inconventional manner. The audio signals carry superimposed timing signalsfor synchronization with the video signals. In a preferred form of theinvention, the video and audio data carry the same timing signals, andat the receiving end the video signals are slaved to the audio signals.If synchronization is lost, because of problems in the network, forexample, video frames are either skipped or are repeated at the receiverto restore synchronization with the audio.

In a preferred form of the invention, as illustrated in FIG. 7, thevideo display 70 of videophone 62 not only displays images fromvideophone 64 as at receiver position 105 but in addition is connectedas by way of line 106 to display the image 107 from camera 68 that isbeing transmitted by videophone 62. Similarly, camera 72 is connected byway of line 108 to display 74 so that the image detected by camera 72 isdisplayed at videophone 64. This allows the user of each videophone tomonitor the image being transmitted.

If desired, a broadband modem, or server 110, can be connected to eitheror both videophones through the network 66 to supply data or images to asingle videophone to permit the videophone to operate as a standalonereceiver.

The adaptive mode of operation of a videophone receiver used in thepresent system is illustrated in the flow diagrams of FIGS. 8, 9 and 10,to which reference is now made. In this mode, a receiver (such asvideophone 64) initially records the size and arrival time of eachincoming data packet, the time it was sent (typically from the Real TimeProtocol (RTP) time stamp), the gap between it and the previous packet,and any apparent delay of the packet, as described in block 120 of FIG.8.

The sequence of packet arrival times/sizes is detected, and the sequenceis examined to determine whether some link the packet has traveled overhas a lower bandwidth than the videophones are attempting to use. Thisis implied if packets are arriving increasingly late by an amount that'sconsistent with the sizes of the packets.

If enough data on arrival times has been accumulated (block 122 in FIG.8), and the arrival times are on average greater than the times betweenpackets when the packets were sent, and a certain period of time hasgone by since the last time an adjustment in the bit rate of thetransmitted packets was requested, then the receiver tells the sender,via an RTCP “USER” message or other backchannel or feedback signal, toreduce the bandwidth used (block 124). The request for a lower bandwidthat this time is recorded, and arrival data is not accumulated for adefined period (such as ½ or 1 second), so that the sender has time toprocess the request to change the bandwidth (block 126).

When a sender receives an RTCP “USER” message or other backchannelmessage (block 128 in FIG. 9) telling it to reduce bandwidth, the sendermay use one or more of the following mechanisms (block 130) to reducethe bandwidth being used:

-   -   a) Increasing the compression ratio    -   b) Modifying the frame resolution    -   c) Reducing the packet rate    -   d) Renegotiating the media channels (video and/or audio) or        using a previously negotiated alternative channel.

The amount by which the sender would change the bandwidth would includeall overheads (such as IP and RTP overhead), and the sender sets the newbit rate to a certain percentage less than the bit rate that the arrivaltimes implied. For example, if data packets are being sent every 33 ms,are equal size, and are arriving every 44 ms on average (i.e. eachpacket is more delayed than the previous one), this would imply thatsome link the packet traveled over has a maximum bandwidth of roughly ¾of the rate that is being sent. An RTCP “USER” packet specifying thisarrival rate would be sent to the sender to process. The sender wouldthen reduce its overall bit rate (including IP and RTP overhead) tosomewhat less than ¾ of the previous bit rate.

If the system successfully runs at a lowered bit rate for a period, suchas 4 seconds (block 132 in FIG. 10), then the bandwidth is increased,usually by a percentage of the amount it was lowered (such as 25%)(block 134). Alternatively, the bandwidth used could be increasedregardless of whether it is below the initial bandwidth.

The resilience of the present system in responding to errors isillustrated in the flow diagram of FIG. 11. As illustrated, as eachpacket is sent, the sender remembers if it was a reference slice, thenit is determined whether each packet that was sent is known to bereceived or determined to be lost (block 140 in FIG. 11) via reports ofacknowledgements (ACKs) and/or non-acknowledgements (NACKs) from thereceiver. The sender records whether a packet was lost or received,updates the running estimate of incoming packet loss (block 142), anddetermines if the lost packet contained a reference slice (block 144),and if so, updates the sender statistics (block 146) and determineswhether a reference slice known to have been received is available (box148). If not, a reference slice will be transmitted (box 150); if so, aP-slice will be transmitted (box 152).

If a certain number of milliseconds have gone by since the last changein error resilience (box 154), then the sender determines whichresilience parameters are appropriate for this expected level of lossand the encoder's settings are changed appropriately (box 156).

In one form of the invention, the following settings (modes) orresilience levels are available in the encoder:

-   -   1. Pure P-slice stream, all reference slices.    -   2. Every 2nd P-slice is a reference slice.    -   3. Every 3rd P-slice is a reference slice.    -   4. Every 4th P-slice is a reference slice.    -   5. Every 4th P-slice is a reference slice, and reference slices        are sent multiple times.    -   6. Every 5th P-slice is a reference slice, and reference slices        are sent multiple times.    -   7. Every Nth slice is an I-slice, and the P-slices are not used        for reference and reduce the frame rate to 15 frames per second.    -   8. All I-slice mode, and the frame rate is reduced to 10 frames        per second.        More modes could be provided, if desired, and the number of        frames that are reference slices could be modified to produce        the best subjective result.

The exact loss levels that are used to select different resiliencelevels are a tuning parameter for the system. For example, in oneimplementation of the invention, the loss levels may be less than 3%,3%, 5%, 10%, 15%, 20%, 30%, and 50%, respectively, for the resiliencemodes 1-8 above. Alternatively, instead of fixed loss percentages todetermine the resilience mode, the mode can be determined dynamically bycounting how frequently and for how long the video at the receiver isincorrect because of a missing reference slice.

If reference slices are missing from the received data more often than athreshold, such as one missing reference slice per second, then thesystem responds to increase the resilience level. If reference slicesare missing less often than a threshold, such as one missing referenceslice every 2 seconds, then the system responds to reduce the resiliencelevel. Whether a slice is a reference slice may depend on the change ofscene in the camera image, or the amount of motion from one frame toanother (as opposed to a fixed frequency, such as every 3 frames). Thisfeature may be combined with any of the 8 resiliency levels above.

The quality of the image displayed at a receiver can be improved byweighting sections of an image frame, so that the face of a personcommunicating via the videophone will be emphasized over theless-important background. Areas where there is motion may be emphasizedover static parts of the image, or image recognition techniques may beemployed to emphasize specific features, such as a face, or parts of theface such as the eyes or mouth. Once areas of importance are identified,the quantization factor of the image (how much information is retainedfor transmittal, versus how much is discarded) can be adjusted, so thatthe quality of the image at the receiver is maximized.

It is noted that a single videophone can be used as a receiver, toreceive data from a source such as the server 110 (FIG. 6), which couldbe a weather channel, traffic report, or the like which wouldperiodically transmit an I-slice to ensure error resiliency.

When two videophones are communicating on a network, it is not necessarythat they operate at the same resolution, or bit rate; they can operateasymmetrically, with each adapting to errors occurring in the connectingnetwork. Each videophone will operate at a bit rate that provides thebest connection, and this may depend, in part, on the networkconnections for each; for example, a particular phone may have aconnection which enables it to receive at a higher bandwidth than isavailable for sending.

Although the invention has been described in terms of preferredembodiments, it will be apparent that numerous variations andmodifications may be made without departing from the true spirit andscope thereof, as set out in the accompanying claims.

1. A video telephone system, comprising: a first videophone having afirst camera connected to a first encoder and a first transmitter, andhaving a first decoder connected to a first display; the firstvideophone in communication with a second videophone, the secondvideophone having a second camera connected to a second encoder and asecond transmitter, and having a second decoder connected to a seconddisplay; the first encoder and first decoder of the first videophone andthe second encoder and second decoder of the second videophone beinginterconnected by a network for video communication, wherein said firstand second encoders are adaptively responsive to network-caused errorsin said communication to provide real-time error recovery, wherein eachof the first and second encoders is responsive to image video signalsfrom its corresponding camera to encode data packets representing animage corresponding to the image video signals, wherein each of thefirst and second encoders produce additional data packets that representonly the differences between a first reference image frame andsubsequent image frames, wherein, when an error produces an incorrectlydecoded reference image frame, the first and/or second encoder willencode said subsequent image frames based upon previously known,correctly decoded frames.
 2. The system of claim 1, wherein saidencoders for said videophones transmit compressed audio and videosignals through said network.
 3. The system of claim 2, wherein saidaudio and video signals are synchronized.
 4. The system of claim 1,wherein said data packets representing an image corresponding to theimage video signals is dependent on data representing a prior image. 5.The system of claim 4, wherein the first decoder is in communicationwith the second encoder; and wherein the second encoder includes meansresponsive to changes in conditions in said network to adapt the secondencoder to such changes.
 6. The system of claim 5, wherein said meansresponsive to changes in conditions in said network includes at leastone means to change the bandwidth requirement that is selected from thegroup consisting of: modifying a frame resolution of transmitted videoimages, reducing a packet rate of transmitted encoded video image data,or changing said network.
 7. The system of claim 1, wherein the firstdata packet is a reference for each subsequent data packet; and whereinreference data packets are produced at a predetermined frequency.
 8. Thesystem of claim 7, wherein the frequency at which the reference datapackets are produced by an encoder is adaptively varied by the encoderin response to said network-caused errors.
 9. The system of claim 1,wherein first and second transceivers are video telephones.
 10. Thesystem of claim 1, wherein said network produces errors in videocommunication due to data corruption or loss.
 11. The system of claim10, wherein said adaptively responsive first and second encoders respondin real time to data corruption or loss in said network to provide saiderror recovery.
 12. The system of claim 11, wherein said error recoveryis based on previously-transmitted reference data.
 13. The system ofclaim 1, further including means for storing or displaying receivedvideo in each of-first and second videophones.
 14. A method of highresolution video communication, comprising: transmitting compressedvideo data signals representing an image through a network; detectingnetwork-caused errors in said video data signals; adaptively changingsaid video data signals in real time to provide error recovery; andstoring or displaying said video image, wherein adaptively changing saidvideo data signals packets in real time comprises encoding a subsequentpacket of video data signals to produce a signal packet representing thedifference between a subsequent frame and a reference frame packet upondetection of network-caused errors, wherein the reference frame packetwas incorrectly decoded or the reference frame packet was generated froma previous packet that was not received or incorrectly decoded.
 15. Themethod of claim 14, wherein transmitting compressed video signalsincludes: obtaining a series of video image frames; encoding a firstframe to generate a first packet of video data signals; encodingsubsequent frames to generate corresponding subsequent video data packetsignals; and supplying said data packet signals sequentially to saidnetwork.
 16. The method of claim 14, further includes producing multipleframe reference packets for a set of subsequent packets.
 17. The methodof claim 14, further includes producing a reference data packet for aset of subsequent frame data packets, wherein providing error recoveryincludes varying the frequency at which the reference data packets areproduced.